A simple sketching algorithm for entropy estimation
نویسندگان
چکیده
We consider the problem of approximating the empirical Shannon entropy of a high-frequency data stream when space limitations make exact computation infeasible. It is known that αdependent quantities such as the Rényi and Tsallis entropies can be estimated efficiently and unbiasedly from low-dimensional α-stable data sketches. An approximation to the Shannon entropy can be obtained from either of these quantities by taking α sufficiently close to 1. However, practical guidelines for the choice of α are lacking. We avoid this problem by going directly to the limit. We show that the projection variables used in estimating the Rényi entropy can be transformed to have a proper distributional limit as α approaches 1. The Shannon entropy can then be estimated directly from a data sketch based on this limiting distribution. We derive properties of the distribution, showing that it has a surprisingly simple characteristic function (iθ) and that the kth moment of the exponential of such a variable is k for all non-negative real values of k. These properties enable the Shannon entropy to be estimated directly from the associated data sketch as the logarithm of a simple average. We obtain the Fisher information for the statistical problem of recovering the entropy from the data sketch and hence a lower bound on the standard error of the estimated entropy. We show that our proposed estimator has theoretical statistical efficiency of 96.8% and confirm this with an empirical study. Finally we demonstrate that in order for the estimator to have 1+ ǫ coverage with high probability the sketch must have size O(1/ǫ), in agreement with theoretical bounds.
منابع مشابه
A simple sketching algorithm for entropy estimation over streaming data
We consider the problem of approximating the empirical Shannon entropy of a highfrequency data stream under the relaxed strict-turnstile model, when space limitations make exact computation infeasible. An equivalent measure of entropy is the Rényi entropy that depends on a constant α. This quantity can be estimated efficiently and unbiasedly from a low-dimensional synopsis called an α-stable da...
متن کاملSketching and Streaming High-Dimensional Vectors
A sketch of a dataset is a small-space data structure supporting some prespecified set of queries (and possibly updates) while consuming space substantially sublinear in the space required to actually store all the data. Furthermore, it is often desirable, or required by the application, that the sketch itself be computable by a small-space algorithm given just one pass over the data, a so-call...
متن کاملADAPTIVE NEURO FUZZY INFERENCE SYSTEM BASED ON FUZZY C–MEANS CLUSTERING ALGORITHM, A TECHNIQUE FOR ESTIMATION OF TBM PENETRATION RATE
The tunnel boring machine (TBM) penetration rate estimation is one of the crucial and complex tasks encountered frequently to excavate the mechanical tunnels. Estimating the machine penetration rate may reduce the risks related to high capital costs typical for excavation operation. Thus establishing a relationship between rock properties and TBM pe...
متن کاملSketching for Nearfield Acoustic Imaging of Heavy-Tailed Sources
We propose a probabilistic model for acoustic source localization with known but arbitrary geometry of the microphone array. The approach has several features. First, it relies on a simple nearfield acoustic model for wave propagation. Second, it does not require the number of active sources. On the contrary, it produces a heat map representing the energy of a large set of candidate locations, ...
متن کاملAn Advanced State Estimation Method Using Virtual Meters
- Power system state estimation is a central component in energy management systems of power system. The goal of state estimation is to determine the system status and power flow of transmission lines. This paper presents an advanced state estimation algorithm based on weighted least square (WLS) criteria by introducing virtual meters. For each bus of network, except slack bus, a virtual meter...
متن کامل